Business Analytics, Data Science and Machine Learning Trends
Author
Anu Sharma, Cindy Guzman, Gavin Boss
Published
September 10, 2025
1 Overview
This analysis explores trends in Business Analytics, Data Science, and Machine Learning job postings by focusing on the skills required for these roles. We analyze how different skill combinations impact salary, remote work opportunities, and career paths.
2 Data Loading and Setup
Code
import pandas as pdimport numpy as npimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsimport plotly.io as pioimport jsonimport refrom collections import Counterpio.templates.default ="plotly_white"pio.renderers.default ="notebook"# Load data from csvdf = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")# print(df.head())# Skills related columns identified through manual inspection of schemaskills_columns = ['SKILLS', 'SKILLS_NAME', 'SPECIALIZED_SKILLS', 'SPECIALIZED_SKILLS_NAME','SOFTWARE_SKILLS', 'SOFTWARE_SKILLS_NAME', 'CERTIFICATIONS', 'CERTIFICATIONS_NAME']print("\nSkills-related columns identified through schema:")for col in skills_columns:if col in df.columns:print(f"- {col}")
We identified that ‘SKILLS_NAME, ’SOFTWARE_SKILLS_NAME’ and ‘SPECIALIZED_SKILLS_NAME’ columns are most relevant to the analysis.
3 Skills Data Preprocessing
Code
# Apply filtersdf_filtered = df.dropna(subset=['SALARY', 'TITLE'])# Convert salary to numeric and filterdf_filtered['SALARY'] = pd.to_numeric(df_filtered['SALARY'], errors='coerce')df_filtered = df_filtered[df_filtered['SALARY'] >0]print(f"Records after filtering: {len(df_filtered):,}")df_skills = df_filtered.copy()# Focus on key ML/Data Science skills. We identifief some key skills for# ML/DS roles manually.key_skills = ['Python (Programming Language)','R (Programming Language)','SQL (Programming Language)','Machine Learning','Data Science','Data Analysis','Statistics','Artificial Intelligence','TensorFlow','PyTorch (Machine Learning Library)','Pandas (Python Package)','NumPy (Python Package)','Scikit-Learn (Python Package)','Big Data','Apache Spark','Apache Hadoop','Amazon Web Services','Microsoft Azure','Google Cloud Platform (Gcp)','Data Visualization','Tableau (Business Intelligence Software)','Power BI','Natural Language Processing (NLP)','Computer Vision','Deep Learning' ]print(f"Using focused {len(key_skills)} ML/Data Science skills for analysis")# Create binary features for each key skill.for skill in key_skills:# Clean skill name for column naming# Eg: R (Programming Language) --> has_r_programming_language skill_col_name =f'has_{skill.lower().replace(" ", "_").replace("-", "_").replace("(", "").replace(")", "")}' df_skills[skill_col_name] = ( df_skills['SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) | df_skills['SOFTWARE_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) | df_skills['SPECIALIZED_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) ).astype(int)print("Binary skill features created")# Create ML/DS role indicator using focused skillscore_ml_skills = ['has_machine_learning', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library','has_deep_learning', 'has_natural_language_processing_nlp', 'has_computer_vision']core_ds_skills = ['has_python_programming_language', 'has_r_programming_language', 'has_statistics', 'has_data_analysis', 'has_big_data', 'has_data_visualization','has_data_science', 'has_pandas_python_package', 'has_numpy_python_package','has_scikit_learn_python_package']# Role indicators# ML roles are straighforward.df_skills['is_ml_role'] = ( (df_skills[core_ml_skills].sum(axis=1) >0)).astype(int)# R language is primarily associated with Data Science field. So,# if job requires R language or if it has more than one data science# skills then it is considered DS role.df_skills['is_ds_role'] = ( df_skills['has_r_programming_language'] ==1| (df_skills[core_ds_skills].sum(axis=1) >1)).astype(int)df_skills['is_ml_ds_role'] = ((df_skills['is_ml_role'] ==1) | (df_skills['is_ds_role'] ==1)).astype(int)# Remote work indicatordf_skills['is_remote'] = df_skills['REMOTE_TYPE'].fillna(0).astype(int)df_skills['experience_years'] = df_skills['MIN_YEARS_EXPERIENCE'].fillna(0)df_final = df_skills# # Check which skills actually exist in the dataframe# available_core_ml = [skill for skill in core_ml_skills if skill in df_skills.columns]# available_ds_skills = [skill for skill in data_science_skills if skill in df_skills.columns]# print(f"Available core ML skills ({len(available_core_ml)}): {available_core_ml}")# print(f"Available DS skills ({len(available_ds_skills)}): {available_ds_skills}")# # ML role if has core ML skills OR (Python + Statistics/Data Analysis)# if available_core_ml:# df_skills['has_core_ml'] = df_skills[available_core_ml].sum(axis=1) > 0# else:# df_skills['has_core_ml'] = False# if available_ds_skills and 'has_python_programming_language' in df_skills.columns:# # Check for Python + (Statistics OR Data Analysis OR Data Science)# stats_cols = [col for col in ['has_statistics', 'has_data_analysis', 'has_data_science'] if col in df_skills.columns]# if stats_cols:# df_skills['has_ds_combo'] = (df_skills['has_python_programming_language'] == 1) & (df_skills[stats_cols].sum(axis=1) > 0)# else:# df_skills['has_ds_combo'] = False# else:# df_skills['has_ds_combo'] = False# df_skills['is_ml_role'] = (df_skills['has_core_ml'] | df_skills['has_ds_combo']).astype(int)# # Create remote work indicator# df_skills['is_remote'] = df_skills['REMOTE_TYPE'].fillna(0).astype(int)# df_skills['experience_years'] = df_skills['MIN_YEARS_EXPERIENCE'].fillna(0)# df_final = df_skillsprint(f"Final dataset size: {len(df_final):,}")print(f"ML/Data Science roles identified: {df_final['is_ml_ds_role'].sum():,}")
Records after filtering: 30,808
Using focused 25 ML/Data Science skills for analysis
Binary skill features created
Final dataset size: 30,808
ML/Data Science roles identified: 5,429
3.1 Heuristics employed for ML/DS roles
We identified ML and Data Science roles using specific technical skills listed in the dataset. ML roles need skills like TensorFlow, PyTorch, Deep Learning, etc. Data Science roles typically require R programming, Python with Statistics, or multiple data analysis tools.
The goal is to see how these specialized skills impact salary and career opportunities. We’ll use machine learning models to find patterns that can guide job seekers in choosing which skills to develop.
4 Feature Engineering for ML
Code
# Just prepare the modeling datasetmodeling_cols = ['SALARY', 'is_ml_ds_role', 'is_remote', 'experience_years'] +\ [col for col in df_final.columns if col.startswith('has_')]df_modeling = df_final[modeling_cols].copy()print("Features for modeling:")print(f"Dataset shape: {df_modeling.shape}")print(f"Columns: {list(df_modeling.columns)}")print(f"Missing values: {df_modeling.isnull().sum().sum()}")
The clustering analysis grouped jobs based on their skill requirements and characteristics. We found 6 distinct job clusters, each with different salary levels, remote work availability, and skill profiles.
Key Findings:
Cluster 4 shows higher concentrations of ML/DS roles with corresponding higher average salaries
Remote work availability varies significantly across clusters, suggesting that certain skill combinations are more compatible with remote positions.
Experience requirements differ by cluster, indicating distinct career progression paths.
Takeaways for Job Seekers:
The specialized ML/DS cluster (Cluster 4) offers good pay but fewer opportunities.
Target skills that appear in higher-paying clusters to maximize salary potential
If remote work is important, target skills common in Cluster(5).
6 Supervised Learning: Multiple Regression
Code
# Identify regression features.regression_features = skill_feature_cols + ['experience_years', 'is_remote', 'is_ml_ds_role']# Prepare regression data using salary as the target variableX_reg = df_modeling[regression_features].fillna(0)y_reg = df_modeling['SALARY']X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)print(f"Training set size: {len(X_train):,}")print(f"Test set size: {len(X_test):,}")# Scale featuresscaler_reg = StandardScaler()X_train_scaled = scaler_reg.fit_transform(X_train)X_test_scaled = scaler_reg.transform(X_test)# Multiple Linear Regressionlr = LinearRegression()lr.fit(X_train_scaled, y_train)# Random Forest Regressionrf_reg = RandomForestRegressor(n_estimators=100, random_state=42)rf_reg.fit(X_train_scaled, y_train)print("Skills based regression models training completed")
Training set size: 24,646
Test set size: 6,162
Skills based regression models training completed
Code
# Evaluate regression models# Linear Regression predictionsy_pred_lr = lr.predict(X_test_scaled)rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))r2_lr = r2_score(y_test, y_pred_lr)# Random Forest predictionsy_pred_rf = rf_reg.predict(X_test_scaled)rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))r2_rf = r2_score(y_test, y_pred_rf)print("Skills-based Regression Model Performance:")print(f"Linear Regression - RMSE: ${rmse_lr:,.2f}, R²: {r2_lr:.4f}")print(f"Random Forest - RMSE: ${rmse_rf:,.2f}, R²: {r2_rf:.4f}")# Feature importance for Random Forest# Only use features that actually exist in the modelactual_feature_names = [col for col in regression_features if col in X_train.columns]importances = rf_reg.feature_importances_# Visualize feature importancefig = px.bar(x=actual_feature_names, y=importances, title="Skills Impact on Salary (Random Forest Feature Importance)", labels={'x': 'Features', 'y': 'Importance'})fig.update_layout(template="plotly_white", xaxis_tickangle=-45)fig.show()# Top skills by salary impactskill_importance =list(zip(actual_feature_names, importances))skill_importance.sort(key=lambda x: x[1], reverse=True)print("\nTop skills by salary impact:")for skill, importance in skill_importance[:10]:print(f"{skill}: {importance:.4f}")
Skills-based Regression Model Performance:
Linear Regression - RMSE: $37,899.28, R²: 0.2780
Random Forest - RMSE: $32,514.17, R²: 0.4686
Top skills by salary impact:
experience_years: 0.4935
is_remote: 0.0721
has_data_analysis: 0.0427
has_tableau_business_intelligence_software: 0.0371
has_amazon_web_services: 0.0360
has_sql_programming_language: 0.0354
has_python_programming_language: 0.0297
has_statistics: 0.0288
has_big_data: 0.0256
has_data_science: 0.0256
6.1 Regression Analysis: What drives salary?
We built the prediction models to understand how skills influence salary. The Random Forest model achieved R2 of 0.47 compared to 0.28 for Linear Regression, showing that skill-salary relationships are complex.
Model Performance: Random Forest R2 of 0.07 means it explains 47% of salary variation. RMSE of 32514 shows typical prediction error It shows skill alone don’t fully determine salary - other factors matter too
Feature Importance Results: The chart shows which skills have the strongest impact on salary predictions. Top Salary Drivers(by importance): 1. Experience years(0.49) - the biggest factor, almost half of salary determination 2. Remote work(0.07) - remote positions tend to pay differently 3. Data Analysis skill(0.04) - core analytical capability 4. Tableau(0.04) - visualization and BI tool 5. AWS(0.04) - cloud computing platform 6. SQL(0.04) - database querying 7. Python(0.03) - programming language
Implications for Career Development: Experience matters - nearly 50% of salary impact comes from years of computer Remote works capability add salary premium Combination of skills drives salary differences Experience(years) and remote work flexibility influence compensation
7 Supervised Learning: Classification Using Skills (Random Forest Only)
Code
# Prepare features for classification.classification_features = skill_feature_cols + ['experience_years', 'is_remote']# Prepare classification dataX_clf = df_modeling[classification_features].fillna(0)y_clf = df_modeling['is_ml_ds_role']# Train/test split for classificationX_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)# Scale featuresscaler_clf = StandardScaler()X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)X_test_clf_scaled = scaler_clf.transform(X_test_clf)# Random Forest Classificationrf_clf = RandomForestClassifier(n_estimators=100, random_state=42)rf_clf.fit(X_train_clf_scaled, y_train_clf)print("Skills-based classification model trained successfully!")
Skills-based classification model trained successfully!
Code
# Random Forest predictionsy_pred_rf_clf = rf_clf.predict(X_test_clf_scaled)accuracy_rf = accuracy_score(y_test_clf, y_pred_rf_clf)f1_rf = f1_score(y_test_clf, y_pred_rf_clf)print("Skills based Classification Model Performance:")print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, F1 Score: {f1_rf:.4f}")# Confusion Matrix for Random Forestcm = confusion_matrix(y_test_clf, y_pred_rf_clf)# Visualize confusion matrixfig = px.imshow(cm, text_auto=True, aspect="auto", title="Confusion Matrix - ML/DS Role Classification", labels=dict(x="Predicted", y="Actual"), color_continuous_scale="Blues")fig.update_layout(template="plotly_white")fig.show()print("Classification Report:")print(classification_report(y_test_clf, y_pred_rf_clf))# Only use features that actually exist in the classification modelclf_actual_feature_names = [col for col in classification_features if col in X_train_clf.columns]clf_importances = rf_clf.feature_importances_# Visualize classification feature importancefig = px.bar(x=clf_actual_feature_names, y=clf_importances, title="Skills Impact on ML/Data Science Role Classification", labels={'x': 'Features', 'y': 'Importance'})fig.update_layout(template="plotly_white", xaxis_tickangle=-45)fig.show()
Skills based Classification Model Performance:
Random Forest - Accuracy: 0.9995, F1 Score: 0.9986
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 5082
1 1.00 1.00 1.00 1080
accuracy 1.00 6162
macro avg 1.00 1.00 1.00 6162
weighted avg 1.00 1.00 1.00 6162
The classification model predicts whether a job is an ML/Data Science role based on its skill requirements. The Random Forest Classifier achieved stong performance in distinguishing these specialized roles from analyst positions.
Model Performance Interpretation: -Accuracy shows the model correctly identified all roles -It suggests ML/DS roles have very distinct skill patterns compared to other data jobs -This high accuracy indicates our skill based criteria effectively separates these roles
Feature Importance This chart shows which skills are strongest predictors of ML/DS classification. Skills with higher bars are the “signature” skills that clearly distinguish ML/DS roles from analyst positions.
Actionable Insights - The perfect accuracy shows DL/MS roles require distinctly different skill sets - Focus on the top features to signal ML/DS capabilities to employers - These specialized skills are what separate advance roles from general analytics - Building expertise in high importance features directly increases ML/DS role readiness
8 Model Results Visualization
Code
# Visualization of model resultsfig = make_subplots( rows=1, cols=2, subplot_titles=('Model Performance Comparison', 'Skills vs Salary Impact'), specs=[[{"type": "bar"}, {"type": "bar"}]])# Model performance comparisonmodels = ['Linear Regression', 'Random Forest Regression', 'Random Forest Classification']metrics = [r2_lr, r2_rf, accuracy_rf]fig.add_trace(go.Bar(x=models, y=metrics, name="Performance"), row=1, col=1)# Skills vs salary impacttop_skills_salary = skill_importance[:8]fig.add_trace(go.Bar(x=[s[0] for s in top_skills_salary], y=[s[1] for s in top_skills_salary], name="Salary Impact"), row=1, col=2)fig.update_layout( height=450, showlegend=False, template="plotly_white", title={'text': "Core Model Results - ML/Data Science Skills Analysis",'y': 0.98,'x': 0.5,'xanchor': 'center','yanchor': 'top', }, margin=dict(t=80))fig.show()
9 Key Takeaways and Recommendations
9.1 Summary of Findings
Our analysis of business analytics, data science and machine learning job postings reveals several important patterns:
Skill-Based Job Segmentation: Jobs cluster into 6 distinct groups. Cluster 4 (pure ML/DS) pays $140K with only 77 positions, while Cluster 1 (10,189 jobs) pays $145K with mixed roles. Remote work availability varies from 25% to 56% across clusters.
Salary Drivers: Experience dominates (49% importance) followed by remote work capability (7%). Technical skills like Tableau, AWS, SQL and Python each contribute 3-4%. The R² of 0.47 shows skills explain about half of salary variation.
Role Differentiation: ML/DS roles have distinct skill patterns, achieving 100% classification accuracy. This indicates these specialized positions require clearly different capabilities than analyst roles.
9.2 Recommendations for Job Seekers
For Career Advancement: - Gain experience - it’s the single biggest salary driver (49% importance) - Develop remote work capabilities - adds 7% to salary potential - Learn practical tools: Tableau, AWS, SQL are each worth 3-4% salary impact - Not only ML/DS titles can be considered as Cluster 1 (non-ML) pays $145K vs Cluster 4 (pure ML) at $140K
For Transitioning to ML/Data Science: - The 100% classification accuracy shows these roles need very specific skill combinations - Focus on the specialized skills shown in the classification importance chart - Note: ML/DS specialization has fewer opportunities (Cluster 4 has only 77 jobs)
For Maximizing Opportunities: - For remote work: Target Cluster 5 skills (56% remote, 63% ML/DS roles) - For job volume: Cluster 1 has most opportunities (10,189 jobs) at highest pay ($145K) - For specialization: Cluster 4 is pure ML/DS but limited opportunities (77 jobs)
9.3 Limitations and Considerations
The analysis is based on job posting data which may not reflect actual hiring outcomes
Skill requirements in job posts may differ from day-to-day job responsibilities
Market conditions and geographic factors also influence salaries beyond just skills
The models identify patterns but don’t capture all nuances of career success